-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[blog] Introducing inf2
runtime blog post
#540
Conversation
094fac5
to
05d82b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good.
|
||
[AWS Inferentia2](https://aws.amazon.com/en/ec2/instance-types/inf2/) (Inf2 for short) is the second-generation inference accelerator from AWS. Inf2 instances raise the performance of Inf1 (originally launched in 2019) by delivering 3x higher compute performance, 4x larger total accelerator memory, up to 4x higher throughput, and up to 10x lower latency. Inf2 instances are the first inference-optimized instances in Amazon EC2 to support scale-out distributed inference with ultra-high-speed connectivity between accelerators. | ||
|
||
Relative to the [AWS G5 instances](https://aws.amazon.com/ec2/instance-types/g5/) ([NVIDIA A10G](https://www.nvidia.com/en-us/data-center/products/a10-gpu/)), Inf2 instances promise up to 50% better performance-per-watt. Inf2 instances are ideal for applications such as natural language processing, recommender systems, image classification and recognition, speech recognition, and language translation that can take advantage of scale-out distributed inference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we quote the inf2 numbers here against say an A100 for reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do this once we have the profiling numbers to compare, I could only find this stat relative to G5.
|
||
## 📦 Deploying a model on Inferentia2 with NOS | ||
|
||
Deploying models on AWS Inferentia2 chips presents a unique set of challenges, distinctly different from the experience with NVIDIA GPUs. This is primarily due to the lack of a mature toolchain for compiling, profiling, and deploying models onto these specialized ASICs. To effectively utilize the AWS Inferentia2 chips, custom model tracing and compilation are essential steps. This process demands a deep understanding of the deployment toolchain, including PyTorch IR op-support and the [AWS Neuron SDK](https://github.com/aws-neuron/aws-neuron-sdk), to optimize model performance fully. NOS aims to bridge this gap and streamline the deployment process, making it more accessible for developers to leverage the powerful inference capabilities of AWS Inferentia2 for their inference workloads and expose easy-to-use gRPC/RESTful services in a straightforward manner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] distinctly different -> distinct. Also wouldn't say 'lack of mature toolchain', maybe just point out that the Neuron SDK is very differnt from the torch cuda ecosystem.
| Model | Cloud Instance | Spot | Cost / hr | Cost / month | # of Req. / $ | | ||
| ----- | -------------- | ---- | --------- | ------------ | ---------- | | ||
| [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | `inf2.xlarge` | - | $0.75 | ~$540 | ~685K / $1 | | ||
| **[BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)** | `inf2.xlarge` | ✅ | **$0.32** | **~$230** | ~1.6M / $1 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
05d82b7
to
c023b46
Compare
c023b46
to
3250c5f
Compare
Summary
Related issues
Checks
make lint
: I've runmake lint
to lint the changes in this PR.make test
: I've made sure the tests (make test-cpu
ormake test
) are passing.